Transfer Learning: From a Translation Model to a Dense Sentence Representation with Application to Paraphrase Detection
نویسنده
چکیده
It is becoming increasingly difficult to judge the originality of digital content, especially with the continuously growing corpus of digital text. Traditional methods such as n-gram overlap are susceptible to simple obfuscation techniques. In this paper, we investigate several methods of measuring sentence similarity, with particular emphasis on paraphrase detection. We show that a bag-of-vectors approach provides a simple yet effective baseline. We then demonstrate how the encoder from a neural machine translation model can be used to build a powerful paraphrase detection classifier. Duplication of digital text, whether accidentally or deliberately copied, is a major concern for the academic and business communities, alike. Traditional plagiarism detection methods, such as ngram overlap, are susceptible to simple obfuscation techniques (Potthast et al., 2010). A more robust method for comparing the semantic content of sentences is needed. Recently, researchers have shown that using dense representations of words and sentences, along with deep learning techniques, can provide impressive results on a number of natural language processing (NLP) tasks (Socher et al., 2011; Wu et al., 2016). A major roadblock in the development of paraphrase classification models is the lack of large, highquality datasets. Most of the large paraphrase datasets have less than 20,000 sentence pairs (Dolan and Brockett, 2005; Xu et al., 2014). While sufficient for traditional machine learning techniques, the publicly available paraphrase datasets are not sufficiently large to train complex deep learning models. The training difficulty arises because accurate paraphrase detection requires a certain level of semantic understanding. One solution is to apply transfer learning, where a natural language model is pretrained on a separate task, before being applied to the target task (Pan and Yang, 2010). Transfer learning has seen many applications in image processing, where the lower layers of a pretrained model are often used to initialize a new model (Oquab et al., 2014). In some cases the weights of the lower layers can be fixed while training the second model. A similar approach is applied in (Pan et al., 2010) where a denoising autoencoder is trained on a large dataset and then used for a sentiment classification task. In this work, we demonstrate how the encoder from a neural machine translation (NMT) model can be used to build a powerful paraphrase detection classifier. We start by constructing a simple bagof-words baseline for the paraphrase detection task. We then demonstrate that the context vector from a translation vector can be used to classify paraphrases with greater accuracy than the baseline. Finally, we combine the baseline model with the NMT model, to generate an ensemble model which approaches state-of-the-art performance.
منابع مشابه
Image Classification via Sparse Representation and Subspace Alignment
Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...
متن کاملTowards Generalizable Sentence Embeddings
In this work, we evaluate different sentence encoders with emphasis on examining their embedding spaces. Specifically, we hypothesize that a “high-quality” embedding aids in generalization, promoting transfer learning as well as zero-shot and one-shot learning. To investigate this, we modify Skipthought vectors to learn a more generalizable space by exploiting a small amount of supervision. The...
متن کاملEnglish to Hindi Paraphrase Convention for Translating Homoeopathy Literature
The rule based approach to machine translation (MT) confines grammatical rules between the source and the target language with the goal of constructing grammatical translation between the language pair. In this paper, we describe the structural representation of English stemmer, POS tagging and design transfer rules which can generate Hindi sentence from the structural representation of the Eng...
متن کاملNeural Paraphrase Generation using Transfer Learning
Progress in statistical paraphrase generation has been hindered for a long time by the lack of large monolingual parallel corpora. In this paper, we adapt the neural machine translation approach to paraphrase generation and perform transfer learning from the closely related task of entailment generation. We evaluate the model on the Microsoft Research Paraphrase (MSRP) corpus and show that the ...
متن کاملSparse Structured Principal Component Analysis and Model Learning for Classification and Quality Detection of Rice Grains
In scientific and commercial fields associated with modern agriculture, the categorization of different rice types and determination of its quality is very important. Various image processing algorithms are applied in recent years to detect different agricultural products. The problem of rice classification and quality detection in this paper is presented based on model learning concepts includ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017